Data Wrangling

Cory Whitney

Overview

  • Notes on R
  • Tidy code style using tidyR
  • Clean and intuitive functions using dplyr
  • Concise code using magrittr 'Ceci n'est pas une pipe'

Notes on R: About process

  • “[…] writing R code is a hedonistically artistic, left-brained, paint-in-your-hair sort of experience […]
  • learn how to code the same way we learned how to catch salamanders as children – trial and error, flipping over rocks till we get a reward […]
  • once the ecstasy of creation has swept over us, we awake late the next morning to find our canvas covered with 2100 lines of R code […]
  • Heads throbbing with a statistical absinthe hangover, we trudge through it slowly over days, trying to figure out what we did.”

Andrew MacDonald @polesasunder

Notes on R: Focus

Notes on R: Keeping track of work

Keep it tidy

Use ‘#’ to annotate and not run

If not Rmarkdown then at least use ‘—-’ or ‘####’

#Section 1—-

#Section 2####

#Section 3####

TOC in upper right console

Notes on tidy R

Keep it tidy

Check your R version

version

The easiest way to get libraries for today is to install the whole tidyverse:

library(tidyverse)

Notes on tidy R browseVignettes

Keep it tidy

Learn about tidyverse with browseVignettes:

browseVignettes(package = "tidyverse")

The tidy tools manifesto

Notes on R: tidyR process

Keep it tidy

  • Good coding style is like correct punctuation:
  • withoutitthingsarehardtoread

Notes on R: Keep your data tidy

Keep it tidy

  • Keep your data tidy
  • When your data is tidy, each column is a variable, and each row is an observation
  • Consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into the right form for different functions

Notes on R: Tidy Data

Three things make a dataset tidy:

  • Each variable with its own column.
  • Each observation with its own row.
  • Each value with its own cell.

Wrangling: transform

  • Once you have tidy data, a common first step is to transform it
  • narrowing in on observations of interest
  • creating new variables that are functions of existing variables
  • calculating a set of summary statistics

www.codeastar.com/data-wrangling/

Wrangling: dplyr arguments

Format of dplyr

Arguments start with a data frame

  • select: return a subset of the columns
  • filter: extract a subset of rows
  • rename: rename variables
  • mutate: add new variables and columns or transform
  • group_by: split data into groups
  • summarize: generate tables of summary statistics

Getting your data in R

Load data

  • Load the data
participants_data <- read.csv("participants_data.csv")
  • Keep your data in the same folder structure as .RProj
  • at or below the level of .RProj

Wrangling: dplyr library

Using dplyr

library(dplyr)

and others we need today

library(knitr)
library(tidyr)
library(dplyr)
library(magrittr)

Roger Peng

genomicsclass.github.io/book/pages/dplyr_tutorial

Wrangling: dplyr::select aca_work_filter

Subsetting

Select

aca_work_filter <- select(participants_data, academic_parents, working_hours_per_day)

Wrangling: dplyr::select non_aca_work_filter

Subsetting

Select

non_aca_work_filter <- select(participants_data, -academic_parents, -working_hours_per_day)

Wrangling: dplyr::filter work_filter

Subsetting

Filter

work_filter <- filter(participants_data, working_hours_per_day >10)

Wrangling: dplyr::filter work_name_filter

Subsetting

Filter

work_name_filter <- filter(participants_data, working_hours_per_day >10 & letters_in_first_name >6)

Wrangling: dplyr::rename name_length

Rename

participants_data <- rename(participants_data, name_length = letters_in_first_name)

Wrangling: dplyr::rename daily_labor

Rename

participants_data <- rename(participants_data,
daily_labor = working_hours_per_day)

Wrangling: dplyr::mutate

Mutate

participants_data <- mutate(participants_data, labor_mean = daily_labor*mean(daily_labor))

Wrangling: dplyr::mutate

Mutate

Create a commute category

participants_data <- mutate(participants_data, commute = ifelse(km_home_to_zef > 10, "commuter", "local"))

Wrangling: dplyr::group_by

Group group data by commuters and non-commuters

commuter_data <- group_by(participants_data, commute)

Wrangling: dplyr::summarize

Summarize get a summary of travel times and days to response

commuter_summary <- summarize(commuter_data, mean(days_to_email_response), median(name_length))

Wrangling: magrittr use

Pipeline %>%

  • Do all the previous with a pipeline %>%
pipe_data <- participants_data %>% 
   mutate(commute = ifelse(
     km_home_to_zef > 10, 
     "commuter", "local")) %>% 
  group_by(commute) %>% 
  summarize(mean(days_to_email_response), 
            median(name_length), 
            max(years_of_study)) %>% 
  as.data.frame

Wrangling: magrittr try

Pipeline %>%

  • Work on your own with a pipeline %>%

  • Make your own query with dplyr and magrittr

purrr: Apply a function to each element of a vector

library(purrr)

purr Cheatsheet

Using purrr

Use purrr to solve: split a data frame into pieces, fit a model to each piece, compute the summary, then extract the R2.

https://purrr.tidyverse.org/

http://varianceexplained.org/r/teach-tidyverse/

Using purrr for regression

Use purrr

library(purrr)

participants_data_regression <- 
    participants_data %>%
      split(.$batch) %>% # from base R
        map(~ 
          lm(days_to_email_response ~ 
                daily_labor, 
                 data = .)) %>%
  map(summary) %>%
  map_dbl("r.squared")

Tasks for the afternoon: Basic

plot of chunk unnamed-chunk-19plot of chunk unnamed-chunk-19

  • Perform the above assessment of participants_data only where gender is 'F'
  • Use the magrittr pipeline to perform the tasks in short form

Tasks for the afternoon: Advanced

plot of chunk unnamed-chunk-20plot of chunk unnamed-chunk-20plot of chunk unnamed-chunk-20

Work through tasks on the diamonds data in long format in base and short format with magrittr pipeline:

  • select: carat and price
  • filter: only where carat is > 0.5
  • rename: rename price as cost
  • mutate: name expensive or cheap if greater than mean of price/cost
  • group_by: split into cheap and expensive
  • summarize: give summary statistics